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Abstract 


We introduce an approach to train lexical- 
ized parsers using bilingual corpora ob¬ 
tained by merging harmonized treebanks 
of different languages, producing parsers 
that can analyze sentences in either of 
the learned languages, or even sentences 
that mix both. We test the approach 
on the Universal Dependency Treebanks, 
training with MaltParser and MaltOpti- 
mizer. The results show that these bilin¬ 
gual parsers are more than competitive, as 
most combinations not only preserve accu¬ 
racy, but some even achieve significant im¬ 
provements over the corresponding mono¬ 
lingual parsers. Preliminary experiments 
also show the approach to be promising on 
texts with code-switching and when more 
languages arc added. 

1 Introduction 


The need of frameworks for analyzing content 
in different languages has been discussed re¬ 
cently (Dang et al., 2014), and multilingual de¬ 
pendency parsing is no stranger to this challenge. 
Data-driven parsing models flNivre, 2006j ) can be 
trained for any language, given enough annotated 
data. 

On languages where treebanks are not available, 
cross-lingual transfer can be used to train parsers 
for a target language with data from one or more 
source languages. Data transfer approaches (e.g. 
Yarowsky et al. (2001]), |Tiedemann (2014] )) map 


linguistic annotations across languages through 
parallel corpora. Instead, model transfer ap¬ 
proaches (e.g. Naseem et al. (2012)) rely on cross- 
linguistic syntactic regularities to learn aspects of 
the source language that help parse an unseen lan¬ 
guage, without parallel corpora. 


Model transfer approaches have bene- 
fitted from the development of multilin¬ 
gual resources that harmonize annotations. 
Petrov et al. (2011) proposed a universal 
McDonald et al. (201 lj ) em- 
transfer delexicalized parsers 


tagset, and 
ployed it to 


dZeman and Resnik, 2008 [ ). More recently, sev¬ 
eral projects have presented treebank collections 
of multiple languages with their annotations 
standardized at the syntactic level, including 
HamleDT dZeman et al., 2012] ) and the Universal 
Dependency Treebanks dMcDonald et al., 2013j ). 

In this paper we also rely on these resources, 
but with a different goal: we use universal anno¬ 
tations to train bilingual dependency parsers that 
effectively analyze unseen sentences in any of 
the learned languages. Unlike delexicalized ap¬ 
proaches for model transfer, our parsers exploit 
lexical features. The results are encouraging: our 
experiments show that, starting with a monolin¬ 
gual parser, we can “teach” it an additional lan¬ 
guage for free in terms of accuracy (i.e., without 
significant accuracy loss on the original language, 
in spite of learning a more complex task) in the 
vast majority of cases. 


2 Bilingual training 

Universal Dependency Treebanks v2.0 
dMcDonald et al., 2013| ) is a set of CoNLL- 
formatted treebanks for ten languages, annotated 
with common criteria. They include two versions 
of PoS tags: universal tags dPetrov et al., 20111 ) in 
the C POS TAG column, and a refined annotation 
with treebank-specific information in the POSTAG 
column. Some of the latter tags are not part of the 
core universal set, and they can denote linguistic 
phenomena that are language-specific, or phe¬ 
nomena that not all the corpora have annotated in 
the same way. 

































To train monolingual parsers (our baseline), we 
used the official training-dev-set splits provided 
with the corpora. For the bilingual models, for 
each pair - of languages L\, L 2 ; we simply merged 
their training sets into a single file acting as a train¬ 
ing set for L 1 UL 2 , and we did the same for the de¬ 
velopment sets. The test sets were not merged be¬ 
cause comparing the bilingual parsers to monolin¬ 
gual ones requires evaluating each bilingual parser 
on the two corresponding monolingual test sets. 

To build the models, we relied on MaltParser 
( jNivre et al., 2007 [ ). Due to the large number of 
language pairs that complicates manual optimiza¬ 
tion, and to ensure a fair comparison, we used 


MaltOptimizer (Ballesteros and Nivre, 2012 ), an 
automatic optimizer for MaltParser models. This 
system works in three phases: Phase 1 and 2 
choose a parsing algorithm by analyzing the train¬ 
ing set, and performing experiments with default 
features. Phase 3 tunes the feature model and 
algorithm parameters. We hypothesize that the 
bilingual models will learn a set of features that 
fits both languages, and check this hypothesis by 
evaluating on the test sets. 

We propose two training configurations: (1) a 
treebank-dependent tags configuration where we 
include the information in the POSTAG column 
and (2) a universal tags only configuration, where 
we do not use this information, relying only on 
the CPOSTAG column. Information that could be 
present in FEATS or LEMMA columns is not used 
in any case. This methodology plans to answer 
two research questions: (1) can we train bilingual 
parsers with good accuracy by merging harmo¬ 
nized training sets?, and (2) is it essential that the 
tagsets for both languages are the same, or can we 
still get accuracy gains from fine-grained PoS tags 
(as in the monolingual case) even if some of them 
are treebank-specific? 

All models are freely available^ 


3 Evaluation 


To ensure a fair comparison between monolingual 
and bilingual models, we chose to optimize the 
models from scratch with MaltOptimizer, expect¬ 
ing it to choose the parsing algorithm and feature 
model which is most likely to obtain good results. 
We observed that the selection of a bilingual pars¬ 
ing algorithm was not necessarily related with the 
algorithms selected for the monolingual models. 

’http://grupolys.org/software/PARSERS/ 


The system sometimes chose an algorithm for a 
bilingual model that was not selected for any of 
the corresponding monolingual models. 

In view of this, and as it is known that different 
parsing algorithms can be more or less competitive 
depending on the language ( jNivre, 20081 ), we ran 
a control experiment to evaluate the models set¬ 
ting the same parsing algorithm for all cases, exe¬ 
cuting only phase 3 of MaltOptimizer. We chose 
the arc-eager parser for this experiment, as it was 
the algorithm that MaltOptimizer chose most fre¬ 
quently for the monolingual models in the previ¬ 
ous configuration. The aim was to compare the 
accuracy of the bilingual models with respect to 
the monolingual ones, when there is no variation 
on the parsing algorithm between them. The re¬ 
sults of this control experiment are not shown for 
space reasons, but they were very similar to those 
of the original experiment. 


3.1 Results on the Universal Treebanks 

Table[j]compares the accuracy of bilingual models 
to that of monolingual ones, under the treebank- 
dependent tags configuration. Each table cell 
shows the accuracy of a model, in terms of LAS 
and UAS. Cells in the diagonal correspond to 
monolingual models (the baseline), with the cell 
located at row i and column i representing the re¬ 
sult obtained by training a monolingual parser on 
the training set of language Li, and evaluating it 
on the test set of the same language Lj. Each cell 
outside the diagonal (at row i and column j, with 
j yZ i) shows the results of training a bilingual 
model on the training set for Li U L :j , evaluated on 
the test set of L*. 

As we can see, in a large majority of cases, 
bilingual parsers learn to parse two languages with 
no statistically significant accuracy loss with re¬ 
spect to the corresponding monolingual parsers 
(p < 0.05 with Bikel’s randomized parsing eval¬ 
uation comparator). This happened in 74 out of 
90 cases when measuring UAS, or 69 out of 90 in 
terms of LAS. Therefore, in most cases where we 
are applying a parser to texts in a given language, 
adding a second language comes for free in terms 
of accuracy. 

More strikingly, there are many cases where 
bilingual parsers outperform monolingual ones, 
even in this evaluation on purely monolingual 
datasets. In particular, there are 12 cases 
where a bilingual parser obtains statistically 










Table 1: Performance on the Universal Dependency Treebanks test sets using the gold POSTAG information. For each cell, 
its (row,column) pair indicates the language(s) with which the model was trained, with the row corresponding to the language 
where it was evaluated. ' ++ ’ and 1 + ’ indicate that the improvement in performance obtained by the bilingual model is 
statistically significant or not, respectively. and ' - ’ correspond to significant and not significant decreases in accuracy. 



Table 2: Performance on the Universal Dependency Treebanks test sets using the gold CPOSTAG information. The table is 
laid out like Table [7] 


significant gains in LAS over the monolingual 
baseline, and 9 cases with significant gains 
in UAS. This clearly sutpasses the amount 
of significant gains to be expected by chance, 
and applying the Benjamini-Hochberg proce¬ 
dure (Benjamini and Hochberg, 1995 ) to correct 
for multiple comparisons with a maximum false 
discovery rate of 20% yields 8 significant im¬ 
provements in LAS and UAS. Therefore, it is 
clear that there is synergy between datasets: in 
some cases, adding annotated data in a differ¬ 
ent language to our training set can actually im¬ 
prove the accuracy that we obtain in the orig¬ 
inal language. This opens up interesting re¬ 
search potential in using confidence criteria to se¬ 


lect the data that can help parsing in this way, 
akin to what is done in self-training approaches 


dChen et al., 2008jlGoutarn and Ambati, 201 lj ). 


Comparing the results by language, we note that 
the accuracy on the English and Spanish datasets 
almost always improves when adding a second 
treebank for training. Other languages that tend 
to get improvements in this way are French and 
Portuguese. There seems to be a rough trend to¬ 
wards the languages with the largest training cor¬ 
pora benefiting from adding a second language, 
and those with the smallest corpora (e.g. Indone¬ 
sian, Italian or Japanese) suffering accuracy loss, 
likely because the training gets biased towards the 
second language. 















































































Training bilingual models containing a sig¬ 
nificant number of non-overlapping treebank- 
dependent tags tends to have a positive effect. En¬ 
glish and Spanish are two of the clearest exam¬ 
ples of this. As shown in Table |3j which shows a 
complete report of shared PoS tags for each pair of 
languages under the treebank-dependent tags con¬ 
figuration, English only shares 1 PoS tag with the 
rest of the corpora under the said configuration, 
except for Swedish, with up to 5 tags in common; 
and the en-sv model is the only one suffering a 
significant loss on the English test set. Similar be¬ 
havior is observed on Spanish: sv (0), en (1), ja 
(10) and ko (12) are the four languages with fewest 
shared PoS tags, and those arc the four that ob¬ 
tained a significant improvement on the Spanish 
evaluation; while with pt-br, with 15 shared PoS 
tags, we lose accuracy. The validity of this hy¬ 
pothesis is reinforced by an experiment where we 
differentiate the universal tags by language by ap¬ 
pending a language code to them (e.g. EN_NOUN 
for an English noun). An overall improvement was 
observed with respect to the bilingual parsers with 
non-disjoint sets of features. 
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Table 3: Shared language-specific tags between pairs of 
languages 

While all these experiments have been per¬ 
formed on sentences with gold PoS tags, prelim¬ 
inary experiments assuming predicted tags instead 
show analogous results: the absolute values of 
LAS and UAS are slightly smaller across the board, 
but the behavior in relative terms is the same, and 
the bilingual models that improved over the mono¬ 
lingual baseline in the gold experiments keep do¬ 
ing so under this setting. 

On the other hand, Table [2] shows the perfor¬ 
mance of the monolingual and bilingual models 
under the universal tags only configuration. The 
bilingual parsers are also able to keep an ac¬ 
ceptable accuracy with respect to the monolin¬ 
gual models, but significant losses are much more 


prevalent than under the treebank-dependent tags 
configuration. 

Putting both tables together, our experiments 
clearly suggest that not only treebank-specihc tags 
do not impair the training of bilingual models, but 
they are even beneficial, supporting the idea that 
using partially treebank-dependent tagsets helps 
multilingual parsing. We hypothesize that this 
may be because complementing the universal in¬ 
formation at the syntactic level with language- 
specific information at the lower levels (lexical 
and morphological) may help the parser identify 
specific constructions of one language that would 
not benefit from the knowledge learned from the 
other, preventing it from trying to exploit spuri¬ 
ous similarities between languages. This explana¬ 
tion is coherent with work on delexicalized parser 
transfer (Lynn et al., 2014) showing that better re¬ 
sults can be obtained using disparate languages 
than closely-related languages, as long as they 
have common syntactic constructions. Thus, us¬ 
ing universal PoS tags to train multilingual parsers 
can be, surprisingly, counterproductive. 


3.2 Parsing code-switched sentences 

Our bilingual parsers also show robustness 
on texts exhibiting code-switching. Unfortu¬ 
nately, there are no syntactically annotated code¬ 
switching corpora, so we could not perform a for¬ 
mal evaluation. We did perform informal tests, 
by running the Spanish-English bilingual parsers 
on some such sentences. We observed that they 
were able to parse the English and Spanish parts 
of the sentences much better than monolingual 
models. This required training a bilingual tag¬ 
ger, which we did with the free distribution of the 
Stanford tagger (Toutanova and Manning, 2000); 
merging the Spanish and English corpora to train 
a combined bilingual tagger. Under the univer¬ 
sal tags only configuration, the multilingual tag¬ 
ger obtained 98.00% and 95.88% over the mono¬ 
lingual test sets. Using treebank-dependent tags 
instead, it obtained 97.19% and 93.88% over the 
monolingual test sets. Figure Q] shows an interest¬ 
ing example on how using bilingual parsers (and 
taggers) affects the parsing accuracy. 

Table |4| shows the performance on a tiny code¬ 
switching treebank built on top of ten normalized 
tweets^ This confirms that monolingual pipelines 

2 The code-switching treebank follows the Universal Tree- 
bank v2.0 annotations. It can be obtained by asking any of the 
authors. 










r r r P f t y ? p i Aril 

We are working hard on putting available los mejores productos de Espana , thank you 


a) es tagger, es parser 


r r A.11 t r > > > i i a ^ 

We are working hard on putting available los mejores productos de Espana , thank you 
b) en tagger, en parser 


^ r ^ a i i * * ’ 1 ^ i a i 

We are working hard on putting available los mejores productos de Espana , thank you 

c) en-es tagger, en-es parser 

Figure 1: Example with the en, es, en-es models. Dotted lines represent incorrectly-parsed dependencies. The corresponding 
English sentence is: ‘We are working hard on putting available the best products of Spain, thank you ’ 


Tagger 

Parser 

LAS 

UAS 

en 

en 

37.82 

44.23 

es 

es 

27.56 

41.03 

en-es 

en 
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en-es 
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en-es 
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tween languages (e.g. noun-adjective vs adjective- 
noun structure) they can inform the parser about 
the input language. 

4 Conclusions and future work 


Table 4: Performance on a code-switching treebank com¬ 
posed of 10 sentences. 


perform poorly. Using a bilingual tagger helps im¬ 
prove the performance, thanks to accurate tags for 
both languages, but a bilingual parser is needed to 
push both LAS and UAS up to state-of-the-art lev¬ 
els. 


3.3 Adding more languages 


To show that our approach works when more 
languages are added, we created a quadrilingual 
parser using the romanic languages and the fine 
PoS tag set. The results (LAS/UAS) on the mono¬ 
lingual sets were: 80.18/84.64 (es), 79.11/84.29 
(fr), 82.16/86.15 (it) and 84.45/86.80 (pt). In all 
cases, the performance is almost equivalent to the 
monolingual parser. 


Noah’s ARK group (Ammar et al., 2016) has 
shown that this idea can be also adapted to univer¬ 
sal parsing. Our models are a collection of weights 
learned from mixing harmonized treebanks, that 
accurately analyze sentences in any of the learned 
languages and where it is possible to take ad¬ 
vantage of linguistic universals, but they are still 
dependent on language-specific word forms. In¬ 
stead, 1 Ammar et al. (2016] ) rely on multilingual 
word clusters and multilingual word embeddings, 
learning a universal representation. They also sup¬ 
port incorporating language-specific information 
(e.g. PoS tags) to keep learning language-specific 
behavior. To address syntactic differences be¬ 


To our knowledge, this is the first attempt to train 
purely bilingual parsers to analyze sentences ir¬ 
respective of which of the two languages they 
are written in; as existing work on training a 
parser on two languages (Smith and Smith, 2004| ) 
focused on using parallel corpora to transfer lin¬ 
guistic knowledge between languages. 

Our results reflect that bilingual parsers do not 
lose accuracy with respect to monolingual parsers 
on their corresponding language, and can even 
outperform them, especially if fine-grained tags 
are used. This shows that, thanks to universal de¬ 
pendencies and shared syntactic structures across 
different languages, using treebank-dependent tag 
sets is not a drawback, but even an advantage. 

The applications include parsing sentences of 
different languages with a single model, improv¬ 
ing the accuracy of monolingual parsing with 
training sets from other languages, and success¬ 
fully parsing sentences exhibiting code-switching. 

As future work, our approach could bene¬ 
fit from simple domain adaptation techniques 
(Daume III, 2009[ ), to enrich the training set for 
a target language by incorporating data from a 
source language. 
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